134

Applications in Natural Language Processing

5.6.3

Token-Wise Clipping

The token-wise clipping further efficiently finds a suitable clipping range to achieve min-

imal final quantization loss in a coarse-to-fine procedure. At the coarse-grained stage, by

leveraging the fact that those less important outliers only belong to a few tokens, the au-

thors propose to obtain a preliminary clipping range quickly in a token-wise manner. In

particular, this stage aims to quickly skip over the area where clipping causes little accu-

racy influence. According to the second finding, the long tail area only matches with a few

tokens. Therefore, the max value of the embedding at a token can be its representative.

Also, the min value can be representative of negative outliers. Then, a new tensor with T

elements can be constructed by taking out the maximum signal for each token:

Ou = {max(token1), max(token2), ..., max(tokenT )},

Ol = {min(token1), min(token2), ..., min(tokenT )},

(5.15)

where Ou is marked as the collection of upper bounds, Ol is the collection of lower bounds.

The clipping value is determined by:

cu = quantile(Ou, α),

cl = quantile(Ol, α),

(5.16)

where the quantile is the quantile function that computes the α-th quantiles of its input.

A α that minimizes the final loss is searched in a grid search manner. The author chooses

to use a uniform quantizer. Thus, according to cu and cl, a step size s0 of the uniform

quantizer can be computed given the bit-width b by s = cucl

2b1 .

At the fine-grained stage, the preliminary clipping range is optimized to obtain a better

results. The aim is to make some fine-grained adjustments in the critical area to further

provide a guarantee for the final effect. In detail, with the resulting step size s0 from the

coarse-grained stage is adopted for initialization. Then, a learning based on gradient descent

is used to update parameter step size s toward the final loss with learning rate η:

s = sη ∂L

∂s .

(5.17)

Due to the wide range of outliers only corresponding to a few tokens, passing through

the unimportant area from the token perspective (the coarse-grained stage) needs much

fewer iterations than from the value perspective (the fine-grained stage). The special design

of the two stages adequately exploits this feature and thus leads a high efficiency.

5.7

BinaryBERT: Pushing the Limit of BERT Quantization

Bai et al. [6] established the pioneer work for Binary BERT Pre-Trained Models. They first

studied the potential rationales behind the sharp drop from ternarization to binarization

of BERT. They begin with comparing the loss landscapes of full-precision, ternary, and

binary BERT models. In detail, the parameters W1, W2 from the value layers of multi-head

attention in the first two transformer layers are assigned with the following perturbations

on parameters:

˜

W1 = W1 + x · 1x,

˜

W2 = W2 + y · 1y,

(5.18)